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FUZZY TEXT CATEGORIZER 

BACKGROUND OF THE INVENTION 

The present invention relates generally to method and apparatus for 
determining whether an object containing textual infomiation belongs to a 
5 particular category or categories. In addition, the present invention relates to 
the construction of a classifier that automatically determines (i.e., learns) 
appropriate parameters for the classifier. 

Text categorization (i.e., classification) concerns the sorting of 
documents into meaningful groups. When presented with an unclassified 
10 document, electronic text categorization categorizes this document into 
separate groups of documents. Text categorization can be applied to 
documents that are purely textual, as well as, documents that contain both 
text and other forms of data such as images. 

To categorize a document, the document is transformed into a 
15 collection of related text features (e.g., words) and frequency values. The 
frequency of text features is generally only quantified and not qualified (i.e., 
linguistically interpretable). In addition, current techniques for categorization 
such as neural nets and support vector machines use non-intuitive (i.e., black 
box) methods for classifying textual content in documents. It would therefore 
i'^ 20 be advantageous to provide a text categorizer that qualifies the occurrence of 
a feature in a document and/or provides an intuitive measure of the manner in 
which classification occurs. 

SUMMARY OF THE INVENTION 

In accordance with the invention, there is provided a text categorizer, 
25 and method therefor, for categorizing a text object into one or more classes. 
The text categorizer includes a pre-processing module, a knowledge base, 
and an approximate reasoning module. The pre-processing module performs 
feature extraction, feature reduction (if necessary), and fuzzy set generation to 
represent an unlabelled text object in terms of one or more fuzzy sets. The 
30 approximate reasoning module uses a measured degree of match between 
the one or more fuzzy sets and categories represented by fuzzy rules in the 
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knowledge base to assign labels of those categories that satisfy a selected 
decision making rule. 

In accordance with one aspect of the invention, a text object is 
classified by: extracting a set of features from the text object; constructing a 

5 document class fuzzy set with ones of the set of features extracted from the 
text object; each of the ones of the features extracted from the text object 
having a degree of membership in the document class fuzzy set and a 
plurality of class fuzzy sets of a knowledge base; measuring a degree of 
match between each of the plurality of class fuzzy sets and the document 

10 fuzzy set; and using the measured degree of match to assign the text object a 
label that satisfies a selected decision making rule. 

In accordance with another aspect of the invention, a text object is 
classified by: extracting a set of granule features from the text object; 
constructing a document granule feature fuzzy set using ones of the granule 

15 features extracted from the text object; each of the ones of the granule 
features extracted from the text object having a degree of membership in a 
corresponding granule feature fuzzy set of the document granule feature fuzzy 
set and a plurality of class granule feature fuzzy sets of a knowledge base; 
computing a degree of match between each of the plurality of class granule 

20 feature fuzzy sets and the document granule feature fuzzy set to provide a 
degree of match for each of the ones of the granule features; aggregating 
each degree of match of the ones of the granule features to define an overall 
degree of match for each feature; and using the overall degree of match for 
each feature to assign the text object a class label that satisfies a selected 

25 decision making rule. 

BRIEF DESCRIPTION OF THE DRAWINGS 

These and other aspects of the invention will become apparent from 
the following description read in conjunction with the accompanying drawings 
wherein the same reference numerals have been applied to like parts and in 
30 which: 

Figure 1 illustrates a classification system with a categorizer for 
categorizing an unlabeled text object; 
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Figure 2 graphically depicts an example of a fuzzy set; 

Figure 3 is a general flow diagram of one embodiment for categorizing 
an unlabeled text object with a categorizer that uses a class fuzzy set model; 

Figure 4 presents a flow diagram of functions performed by the pre- 
5 processing module to produce a fuzzy set from an unlabeled text object, which 
is depicted generally at 310 in Figure 3; 

Figure 5 illustrates an example of a parse tree of grammar-based 
features that could be extracted from the example sentence "Tom read the 
book" as well as six possible subtree-based features of noun phrases for 'Ihe 
10 book" element of the parse tree; 

Figure 6 illustrates an example of a lexicalized parse tree of lexicalized 



grammar based features for the example parse tree shown in Figure 5; 

Figure 7 illustrates an example of a set of dependency-based features 
that are defined using a dependency graph; 



Figure 9 illustrates an example normalized frequency distribution, 
where the horizontal axis represents each of the text features considered by 
the categorizer, and where the vertical axis represents the frequency of 



20 feature fj in the fuzzy set; 

Figure 10 illustrates an example of a document fuzzy set for a text 
object, where the horizontal axis represents each of the text features 
considered, and where the vertical axis denotes the membership value of 
feature f, in the fuzzy set; 

25 Figure 11 illustrates an example of granule fuzzy sets for the text 

feature "transactiori'; 

Figure 12 illustrates a two-way contingency table of feature, f, and of 
category, c; 
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Figure 8 depicts a flow diagram of one embodiment for extracting 



features from a text object at 402 shown in Figure 4; 
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Figure 13 depicts a flow diagram of an approximate reasoning strategy 
that specifies the acts performed at 312 in Figure 3 for assigning a class label 
to an unlabeled text object using approximate reasoning; 

Figure 14 graphically depicts an example of max-min strategy for a 
class fuzzy set and a document fuzzy set; 

Figure 15 graphically depicts passing a degree of match "x" through a 
filter function S(x) to arrive at an activation value; 

Figure 16 illustrates an alternate embodiment in which a categorizer 
includes a learning system; 

Figure 17 illustrates a detailed flow diagram for constructing a class 
document fuzzy set at 302 in Figure 3; 

Figure 18 illustrates an example of a learned class rule filter; 

Figure 19 is a general flow diagram of another embodiment for 
categorizing an unlabeled text object with a categorizer that uses a granule 
feature model; 

Figure 20 depicts a flow diagram for constructing a document feature 
fuzzy set for a granule feature based model as set forth at 1910 in Figure 19; 

Figure 21 depicts a flow diagram that details the act 1912 shown in 
Figure 19 for assigning a class label to a text object using approximate 
reasoning; and 

Figure 22 depicts a flow diagram for a process that is used to estimate 
each granule fuzzy set at 1902 shown in Figure 19. 

DETAILED DESCRIPTION 

Outline Of Detailed Description: 

1 . Overview 

2. Class Fuzzy Set Models 

2.A Pre-Processing Module 

2.B Approximate Reasoning Module 

3. Granule Feature Based Models 
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3 A constructing Document Feature Fuzzy Sets 

3.B Approximate Reasoning Using Granuie Feature Based Models 

3.C Learning Granule Feature Based Models 

4. converting A Probability Distribution To A Fuzzy Set 

5. Miscellaneous 

1. Overview tnr 

.,„re t iliustrates a o-..- -tern ^ ^^^^ 
oategorizing an unlabeled - —/I' JlrecLses tao, 

classifies tt,e unlabeled text ob,ect 112 ,nto set o ^^^^ 

- are also ^^^^^^^^^^ module 113. For 

a pre-processing module 114 and an PP ^^^^^^ 

,He purpose o, disclosing the ^^'^'^^^^^^^^'^l , „„, ^ ,pp.ciated 

is defined herein to include a single text ob,ec . H ^^^.^^ 

by those skilled in the art that the document 12 «n b 

multiple text obiects, where each text obiec. ,s labeled w,th sepa 

one or more categories. 

The text pre-processing module 114 Includes a feature extractor 115 

.,..::rt:u:iabeiedtex.ob.ctii2.dp^^^^^^^^^ 
----- 

features that are used for learning a ^^^^^^^^ 

olass^y unlabelled text obiect. A .^^^ - ^ extractor 1 1 S to a 
produced by extractor 115 to l,m,. the features pro ^^^^^^^^^ 

predefined threshold number " ^ j^ ^lude feature reducer 

embodiment, the pre-processing module 114 does no 
, 11.andou.pu.sunmodlfiedresu.tsfromthee..co1.^^^^^ ^^^^ 

,„ one embodiment a -^J^J^^^-2:Z text ob.eot. In an 
to define a document fuzzy set 124 *a eP p,3.p,o,essing 
alternate embodiment, the feature vector 116 " 
module 114 directly to the approximate — "O^u^ ^ unit Interval 

(e.g., unit inten/al [0,1]) U-e., 
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variable is defined is assigned a membership value, which can be interpreted 
as the degree of prototypicality of this value to the concept being represented 
by the fuzzy set). Consider an example graphically depicted in Figure 2 for a 
linguistic variable representing temperature in Celsius. The example provides 
that if the linguistic variable for temperature is defined on a universe consisting 
of the interval [0, 40], then the fuzzy set Warm in that universe could be 
defined as follows: 
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if x<0 
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if 0<x<10 


To 
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if 10<x<30 


40- X 


if 30<x<40 


10 




0 


if X > 40 



where |Xwarm(x) denotes the membership value of the value x for temperature 
10 in the Warm fuzzy set. In this case, frequency values in the interval [10, 30] 
have a membership value of 1 and correspond to the core of the fuzzy set. 
Values in the intervals [0, 10] and [30, 40] have membership values in the 
range [0, 1], while other values in the universe have membership values of 
zero in this definition of the concept of Warm temperature. 

15 Referring again to Figure 1, the approximate reasoning module 118 

computes the degree of similarity (i.e., degree of match) between the 
unlabelled text object 112 that is represented in terms of the feature vector 
116 or the document fuzzy set 124 and one or more categories 120. The 
approximate reasoning module 118, which contains matching, filtering and 

20 decision making mechanisms, accesses a knowledge base 122 (i.e., rule 
base) to classify the unlabelled text object 112. In a first embodiment, the 
knowledge base 122 contains fuzzy rules (also referred to as rules) for each 
class (i.e., category), where each rule is made up of a class fuzzy set and an 
associated class filter. In a second embodiment, each rule is made up of a 

25 granule fuzzy set the details of which are presented in section 3 below. 
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During operation of the first embodiment, the approximate reasoning 
module 118: (1) calculates the degree of match between the document fuzzy 
set 124 and a fuzzy set associated with each class (i.e., each class fuzzy set); 
(2) passes the resulting degree of match through a respective filter function 
5 (i.e., class filter); and (3) determines a class label to assign to the unlabelled 
text object based upon the filtered degrees of match (e.g., the class label 
associated with the highest degree of match is assigned to be the class label 
of the text object). 

Figure 3 presents a general flow diagram for categorizing an unlabeled 
10 text object 112 with the categorizer 110. Initially, in accordance with one 
embodiment of the invention using the categorizer 1600 shown in Figure 16, a 
□ fuzzy set is learned for each class of document at 300. Generally, learning 

,'1 class fuzzy sets includes: constructing class fuzzy sets at 302; learning a filter 

1,^ for each class fuzzy set at 304; and generating a class fuzzy knowledge base 

iln 15 at 306. Learning is discussed in more detail in section 2.C.2 with reference to 
; 3 Figures 16-18. In an alternate embodiment using the categorizer 110 shown in 

'l^^ Figure 1, class fuzzy sets, rules, and filters are predefined and stored as part 

C3 of knowledge base 122. 

yJ Subsequently, at 308 for each new document 112 consisting of at least 

i^I 20 one unlabeled text object that is input to the categorizer 1 1 0 or the categorizer 
1600, acts 310 and 312 are repeated. At 310, the pre-processing module 114 
constructs a document fuzz:y set for the unlabeled text object 112, which 
operations are described in detail in Figures 4 and 8. Subsequently at 312 
given (pre-existing or learned) knowledge base 122, the approximate 
25 reasoning module 118 assigns class label(s) to the unlabeled text object 112 
to categorize the document, which operations are described in detail in Figure 
13. 

2. Class Fuzzy Set Models 

This section describes the first embodiment of the present invention in 
30 greater detail while referring to Figures 4-18. 
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2.A Pre-Processing Module 

The purpose of the text pre-processor 1 14 is to transform the document 
or text object 112 into a representation that facilitates the approximate 
reasoning module 118 to perform the task of document classification in an 
5 accurate, automatic, efficient and effective manner. In this first embodiment, 
the transformation process can be decomposed into the following three 
subtasks: feature extraction 115; feature reduction 117; and fuzzy set 
generation 121. 

To generate the feature vector 116 from unlabeled text object 112, the 
10 pre-processing module 114 may embody a number of components such as 
text converters (e.g., HTML to text, PDF to text, postscript to text), a tokenizer, 
a stemmer, a word frequency analyzer, a grammar feature analyzer, and/or a 
noun phrase analyzer (or extractor). A commercial application that provides 
similar functions performed by the text-processing module 114 is Thingfinder 
ilfi 15 offered by Inxight Software, Inc. 

Figure 4 presents a flow diagram of functions performed by the pre- 
1:3 processing module 1 14 to produce a fuzzy set from a (unlabeled) text object 

112, which is depicted generally at 310 in Figure 3. At 402, features are 
extracted from the unlabeled text object 1 12 by feature extractor 1 15. At 404, 
20 the extracted features are optionally filtered by feature reducer 1 17 to reduce 
the number of features extracted at 402 by a predefined threshold number or 
range. 

Using the reduced or remaining features, the fuzzy set generator 121 
uses the feature vector 116 produced by feature reducer 117 to generate 
25 document fuzzy sets 124 at 405. Act 405 includes performing acts 406, 408, 
and 410 in which the fuzzy set generator: identifies the frequency of 
occurrence for each remaining feature (at 406); normalizes the calculated 
frequency of occurrence for each remaining feature to define a frequency 
distribution over the remaining features (at 408); and transforms the 
30 normalized frequency distribution defined over the remaining features using 
the bijective transformation, as described below in section 4, to define a 
document fuzzy set (at 410). 
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2.A.1 Feature Extraction 

The feature extractor 115 generates a feature vector from a document 
(i.e., a text object). The generated feature vector consists of tuples (i.e., a set 
of ordered elements) of the form <featurelD, value>, where "'featurelD' 
5 corresponds to a unique identifier that denotes a feature (e.g., the characters 
that makeup a word such as "Xerox" in the case of word features or a unique 
number that can be used to retrieve the feature when stored in a database or 
hash table) and where 'Va/ue" can be a real number, a binary value, a fuzzy 
set, or a probability distribution (e.g., the frequency of occurrence of the word 
10 feature "Xerox" in a text object). A value can be associated with a featurelD 
using: normalized frequency distributions; smoothed frequency distributions; 
^'3 fuzzy sets; and granule fuzzy sets. 

S3 

!;:; An example of a feature is a word feature. The word feature is identified 

i ; a 
: sr 

by a featurelD that identifies a selected word. The value associated with each 
1^ 15 word feature is the number of occurrences of that selected word appears in 

the text object. More generally, document features produced from a text object 
□ by the pre-processing module 114 may include: (a) a set of words; (b) a set of 

i;i noun phrases; (c) a set of identified entities such people names, product 

name, dates, times, book titles etc.; (d) a set of grammar-based features (e.g., 
20 Figures 5 and 6); (e) a set of dependency-based features (e.g.. Figures 7); or 

(f) combinations of (a)-(e) collectively referred to herein as a set of features. 

Figures 5 and 6 illustrate parse trees 500 and 600 of grammar-based 
features that could be extracted from the example sentence "Tom read the 
book", respectively. Each parse tree uses the following grammatical notation: 
25 "S" to represent a sentence; "NP to represent a noun phrase; "D" to represent 
a determiner; "V" to represent a verb; and "VP" to represent a verb phrase. 

More specifically, Figure 5 illustrates six possible subtree-based 
features at 504 of noun phrases for "the book" element 502 of the parse tree 
500. Words that makeup features extracted from the sentence could be 
30 stemmed using known techniques such as Porter's stemming algorithm, as 
described by Porter in "An Algorithm For Suffix Stripping", Program, Vol. 14, 
No. 3, pp. 130-137, 1980 (herein referred to as "Porter's stemming algorithm"). 
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In contrast, Figure 6 illustrates a lexicalized parse tree 600, with lexicalized 
grammar based features defined as shown for the parse tree 500 in Figure 5. 
However, unlike the parse tree 500 shown in Figure 5, the lexicalized parse 
tree 600 shown in Figure 6 also includes the word, not just grammar notations. 
5 That is, each phrasal node in the lexicalized parse tree 600 is marked by its 
head-word (e.g., V (verb) read). 

Figure 7 illustrates an example of a set of dependency-based features 
that are defined using a dependency graph 700. The dependency graph 700 
represents the sentence "Tom read the book" using nodes and arcs. Each 
10 node represents a word in the sentence and each arc signifies a dependency 
from a head word to a modifier. The arcs are labeled with a grammatical 
relationship, where in this example "sbj" denotes a subject and "obj" denotes 
•^3 an object, and "dt" denotes a direct object. Features 702 and 704 represent 

i U two dependency-based features that can be extracted from the dependency 

15 graph 700 at 701 and 703, respectively. 

I, A. 

: ^ 2.A.1.a Basic Feature Extraction 

's 53* 

□ Figure 8 depicts a flow diagram of one embodiment for extracting 

features from a text object at 402 shown in Figure 4. Although the description 
of feature extraction is limited to word feature extraction, it will be appreciated 

ul 20 by those skilled in the art that a similar approach can be used to extract other 
types of features described above. In accordance with this embodiment, a 
document is transformed into lists of tokens or word list (at 802) that are 
delimited by spaces, and punctuation characters. Tokens that correspond to 
stop words (i.e., words that do not improve the quality of the categorization 
25 such as '1he", "a", "is", etc.) are subsequently eliminated from this list of tokens 
(at 804). The remaining tokens in the list are then stemmed using Porters 
stemming algorithm (at 806). Subsequently, stop words are removed from the 
stemmed word list (at 807), resulting in a list of words. 

Finally, this list of words is transformed to a frequency distribution 
30 consisting of <term, frequency> tuples where frequency denotes the number 
of occurrences of that feature in the document to define a set of features (at 
808, 810, 812, 814, 816, 818, and 820). A word list count initialized at 808 is 
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used to index into each word (i.e., feature) in the word list. If the feature is not 
located at 812 in the feature database, then the feature or unique identifier 
associated with the feature (i.e., FeaturelD) is inserted into the feature 
database and its frequency is set to zero at 814. If the feature already exists in 
5 the feature database or the feature database is initialized at 814, then its 
frequency of that feature is incremented by one at 816. To process the next 
word, the word count is incrennented by one at 818. This counting process 
terminates at 820 once all the text segments have been processed. 

Different combinations of known techniques from natural language 
10 processing such as translation of HTML to text, tokenization, stemming, stop 
word removal, and entity recognition can be used to identify a set of features, 
p In an alternate embodiment grammar based features are identified at 802. In 

''i this alternate embodiment, in addition to tokenization of the document into 

rU words as delimited by white spaces, or punctuation characters, supplemental 

^ 15 parsing is performed to identify grammatical features (e.g., noun phrases), 
and/or various other syntactical structures such as parse trees or parse- 
subtrees, or string subsequences (as disclosed by Cristianini et al., in "An 
Introduction To Support Vector Machines", Cambridge University Press, 
Cambridge, UK, 2000, Section 3.3.3). It will therefore be appreciated by those 
20 skilled in the art that multiple types features may be assigned to the document 
and need not be limited to one type (e.g., wo^d features) as described in the 
embodiment shown in Figure 8. \ 

Also, rather than incrementing the value associated with a feature by 
one at 816 when the occurrence of a feature is located in a text segment, the 
25 value could be incremented in a more intuitive manner, such as an increment 
value that is defined using an aggregation of a number of measures such as 
the frequency of each feature in the document, its location in a document, the 
frequency of each feature (e.g., term) in a reference corpus, and the inverse 
document frequency of the term as described by Bordogna, et al., in "A 
30 Linguistic Modeling Of Consensus In Group Decision Making Based On OWA 
Operators", IEEE Transactions on Systems, Man and Cybernetics 27 (1997), 
pp. 126-132. The textbook by Manning and Schutze, "Foundations Of 
Statistical Natural Language Processing", published in 1999, MIT Press, 
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Cambridge, MA, provides additional background relating to text pre- 
processing performed at 816. 

2.A.1.b Fuzzy Set Feature Extraction 

In one embodiment, the acts which are set forth at 405 in Figure 4 are 
5 performed to create a document fuzzy set. In accordance with this 
embodiment, after computing a feature value (e.g., the frequency of 
occurrence of each feature) identified at 406, the frequency values associated 
with each feature are normalized such that all feature values sum to one. This 
process results in a frequency distribution consisting of tuples <featurelD, 
10 value> with a tuple for each feature that is extracted. 

Figure 9 illustrates an example frequency distribution, where the 

^3 horizontal axis represents each of the features (e.g., words) considered by the 

u3 

categorizer, and where the vertical axis denotes the frequency (normalized or 
otherwise) of feature fj in the fuzzy set. There is no explicit order to the 
U 15 features (e.g., words) on the horizontal axis though each feature can be 
J" thought of being associated/indexed by a unique key fj and ordered according 

to that key. 

In the embodiment with learning shown in Figure 16, each feature value 
can be estimated during training (i.e., construction of a text classifier) using 
20 the m-estimate as follows (for more background see Mitchell in Machine 
Learning, McGraw-Hill, New York, 1997): 

y^li^^ — F^^^{f^<^tureID, Doc) + 1 
\Doc\ + \Vocab\ 

where Freq(featurelD, Doc) denotes the number of occurrences of the feature 
(i.e., featurelD) in the text object Doc; iVocabj denotes the number of unique 
25 features considered as the language of the model (i.e., the number of 
variables used to solve the problem); and /Doc/ denotes the length of the text 
object in terms of all features considered. 

Referring again to Figure 4 at 410, a text object can be represented as 
a discrete fuzzy set consisting of tuples of the form <featurelD, 
30 MTextobject(featurelD)> where featurelD is a text feature as described above and 
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MTextobject(featurelD) denotes the membership value (or degree of 
prototypicality) of the feature for the document fuzzy set. An example of a 
document fuzzy set for a text object is depicted in Figure 10, where the 
horizontal axis represents each of the features (e.g., words) considered in the 

5 system (having no explicit order though each feature can be thought of being 
associated/indexed by a unique key fi and ordered according to that key), and 
where the vertical axis denotes the membership value of feature fi in the fuzzy 
set. This document fuzzy set is derived from a normalized frequency 
distribution of occurrence for each feature using the bijective transformation 

10 process described in section 4 below. 

2.A.1.C Granule Fuzzy Set Feature Extraction 

'=i In another embodiment (also referred to herein as the second 

^3 embodiment and described in detail in section 3), feature values defined over 

j'S the normalized frequency domain of zero to one interval [0, 1] are represented 

15 as granule fuzzy sets. A granule fuzzy set is a set of granules and 

^5 corresponding membership values, where each granule is represented by a 

i'3 fuzzy set and an associated (word) label. Background of granule fuzzy sets is 

i'; disclosed by Shanahan in "Soft Computing For Knowledge Discovery: 

LJ Introducing Cartesian Granule Features", Kluwer Academic Publishers, 

l^t 20 Boston, 2000 (hereinafter referred to as "Soft Computing by Shanahan"), 
which is incorporated herein by reference. 

For example, granule fuzzy sets for frequency variables are illustrated 
in Figure 1 1 using an example frequency variable for the word feature 
''transaction". Generally when constructing a granule fuzzy set, the universe is 
25 partitioned over that which a variable is defined using a plurality of fuzzy sets. 
This is graphically depicted in Figure 11 for the frequency variable associated 
with the word feature "transactioti\ In this example, the plurality of fuzzy sets 
used to partition frequency universe of the word feature "transaction" is 
labeled using the following list of word labels in partition: 

30 Pjransacuon = {Small, Medium, High}. 

This fuzzy partition can be constructed using a number of approaches 
such as clustering, or percentile-based partitioning as outlined in Soft 



-13- 



Computing by Shanahan. One type of partition can be used for all features, or 
a specific fuzzy partition can be used for each feature. Having constructed a 
fuzzy partition of the feature's frequency universe, the granules denoted by 
word labels that make up the partition are used to interpret the frequency 
5 value. 

Using this fuzzy partition, the frequency extracted for the transaction 
feature is linguistically re-interpreted. This corresponds to calculating the 
degree of membership of the frequency of transaction in each of the fuzzy 
sets that partition the domain of the frequency variable. This process is 
10 depicted in the example shown in Figure 11 for the frequency of 0.08. It 
results in the following granule fuzzy set or linguistic description of 0.08: 
:3 {Small/0.3 + Medium/1 .0}. This granule fuzzy set can be interpreted as saying 

3 that Medium is a more prototypical label (description) for a normalized 

% frequency of 0.08 than Small, 

15 2.A.2 Feature Reduction 

The feature extractor 115 described above when applied to a text 
object can potentially lead to a feature vector consisting of millions of features. 
1=4 To make fuzzy set formation more feasible at 405 in Figure 4 or at 300 in 

Figure 3, the feature vector extracted from a text object at 402 may be 

'a sr 

M 20 reduced to fewer elements or a predefined number of elements using feature 
reducer 117 shown in Figure 1. Various techniques can be used to enable 
feature reduction at 404 in Figure 4. 

In one embodiment for filtering extracted features at 404, a database 
(i.e., corpus) of example text documents is used to perform feature reduction. 

25 Typically this corpus forms all or a subset of training database 1604 shown in 
Figure 16 that is used to learn the knowledge base 122 in categorizer 1600. In 
this section it is assumed that the training database 1604 is used for feature 
reduction. It will be appreciate by those skilled in the art that the training 
database can form part of the feature reducer 1 17 or alternatively be separate 

30 from the categorizer as shown in Figure 16. 

More formally, the training database 1604 is a collection of labeled 
documents consisting of tuples <D,, Li> where D, denotes a document, and L, 
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denotes a class label associated with the document D,. Feature reduction can 
be applied at two levels of granularity: corpus-based feature reduction; and 
category-based feature reduction. Processes for performing each type of 
feature reduction are described below. In one embodiment, corpus-based 
5 feature reduction is applied first and subsequently category-based feature 
reduction is applied to the feature vector produced by the feature extractor 
115. 

2.A.2.a Corpus-Based Feature Reduction 

The process of corpus-based feature reduction begins by merging all 
10 training documents into one text object. Subsequently, feature extraction is 
applied to this text object. Having extracted the features and corresponding 
frequency values for the text object, Zipfs Law and/or latent semantic indexing 
(LSI) can be applied to filter those features that are most important, thereby 
reducing the overall number of extracted features. 

15 Zipfs law, which is known in the art, concerns the distribution of 

different words in a text object and states that the product of a feature's 
frequency (while generalizing Zipfs law to include text features as well as 
words) in a text object, and its rank, r, is a constant, c (i.e., f *r = c). In 
accordance with Zipfs law, words having a low frequency will not be useful in 

20 classifying new items. In addition, words that have a high frequency typically 
occur in all classes and thus will not help in discriminating between classes. 
Consequently, a feature vector is reduced using Zipfs law by eliminating 
features that occur frequently or very rarely. 

In an alternative embodiment, these Zipf-based selection criteria are 
25 applied to each document in the training database 1604 to eliminate features 
that have low or high frequency within each document in the training database 
1604. The reduced feature set of each document in the training database 
1604 is formed by taking the set union of the selected features, thereby 
creating a reduced feature training database. Zipf-based reduction could once 
30 again then be applied to this union of features. 

In yet another embodiment, LSI is used to translate the extracted 
feature space into latent concepts space that can be used to explain the 
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variance-co-variance structure of a set of features through linear combinations 
of these features. Having identified the latent concepts, also known as latent 
features, in a feature space they can be used directly as input for the 
categorization process instead of the extracted features. Typically only the top 
5 n latent features are selected whereby the variables are ordered in decreasing 
order of the amount of variance that each latent feature explains. Alternatively, 
only a minimum subset of the latent features that explain a percentage of the 
variance is used. The details of LSI are disclosed by Deenvester, Dumais, 
Furnas , Landauer, and Harshman, in "Indexing By Latent Semantic Analysis" 
10 Journal of the American Society for Information Science, 41(6), 391-407, 
1990. 

2.A.2.b Category-Based Feature Reduction 

Category-based feature reduction reduces features by examining 
separation or discrimination properties provided by class features. One or 

15 more of the following techniques described below can be used to reduce the 
number of features in this way: (1) a mutual information technique; and (2) a 
semantic separation technique for granule fuzzy sets. In addition, other known 
approaches that are not discussed here such as a chi-square technique, an 
information gain technique that measures the number of bits of information 

20 obtained, and a technique using a correlation coefficient, can be used to 
reduce the number of features in a similar way. 

2.A.2.b.1 Mutual Information 

The mutual information (Ml) technique measures the extent to which a 
text feature is associated with a text category. Features that exhibit high 

25 values for mutual information are good discriminators and should therefore be 
selected, whereas low values (zero or close to zero) indicate that a feature 
and class are independent and therefore will not contribute to a good class 
definition and should be eliminated. In an exemplary embodiment, the mutual 
information for a given feature, f, in a given category, c, is calculated as 

30 follows: 
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M/(/.c) = Pr(/%c^)log 



Pr(/-,c^)log 




V 



v 



Pr(f\c') \ 
Pr(/OPr(c-) J 

Pr(/-,c-) \ 
Pr(/-)Pr(c-)J' 



where: Pr{f ^,c^) denotes the probability that a document has feature f and 
belongs to category a, Pr{f^,c) denotes the probability that a document has 
feature f and does not belong to category c; Pr(f " ,c^) denotes the probability 
that a document does not have feature fand belongs to category a, and Pr(f ' , 
c) denotes the probability that a document does not have a feature /and does 
not belong to category c. 

For example, Figure 12 illustrates a two-way contingency table of 
feature, f, and of category, c. Each document exhibits one of the following four 
characteristics: (i) the document contains feature f and belongs to category c 
(denoted as i^,c^)] (ii) the document contains feature f and does not belong to 
category c (denoted as ^^,0"); (iii) the document does not contain feature f and 
belongs to category c (denoted as f ",c"^); and (iv) the document does not 
contain feature f and does not belong to category c (denoted as f ',c ). 

If: A denotes the number of documents exhibiting characteristic f.c"^; B 
denotes the number of documents exhibiting characteristic f^c"; C denotes the 
number of documents exhibiting characteristic f ",c^; D denotes the number of 
documents exhibiting characteristic f ",c'; and N is the number of documents in 
the training database, then the mutual information Ml(f,c) of feature T for a 
class "c" can be rewritten as follows: 



Given a training database, mutual information is computed for each 
feature in the input feature vector using the approach presented above. 
Subsequently, the features whose mutual information measure is less than a 
predetermined threshold are removed from the input feature vector. 
Alternatively, if the number of features is to be reduced to a predetermined 
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number, n, then the input feature vector is reduced to the n features with the 
highest mutual information. 

2.A.2.b.2 Semantic Discrimination Analysis 

Semantic discrimination analysis examines the discriminating power of 
5 a granule feature by examining the mutual dissimilarity of classes as 
represented by granule fuzzy sets defined over these features. To calculate 
the semantic discrimination of a granule feature, the granule fuzzy sets 
corresponding to each text category needs to be constructed. In one 
embodiment, this is accomplished using the bijective transformation process 
10 presented in section 4 below for converting a probability distribution (as 
generated by the feature extraction module 1 15) to a fuzzy set. 

Subsequently, a known probabilistic reasoning strategy based upon 
semantic discrimination analysis is used to determine the mutual dissimilarity 
of class fuzzy sets, as measured in terms of the point semantic unifications 
15 between each granule fuzzy set FS/< and the other class fuzzy sets FSy. This 
discrimination (i.e., Discrim) is formally defined as follows: 

r 



Discrim — Min 

k=\ 



1- MaxPr(^5jF5 .) 



V 



where: 

Vt{FS,\FS.) = ±jj,,,^ (/,)xPr,, a). 



20 such that Pr(FS/f | FSy) denotes the semantic match between fuzzy sets FSk 
and FSy in terms of point semantic unification; PrFSj(fi) denotes the probability 
distribution obtained from the fuzzy set FSj using the process presented in 
section 4 below; //^^^ (/.) denotes the membership value of fi in the fuzzy set 

of FSk; c corresponds to the number of classes in the current system; and i 
25 ranges from 1 to n, where n corresponds to the number of features used. 

Given a training database, a semantic discrimination measure is 
computed for each feature in the input feature vector using the approach 
presented above. Subsequently, the features whose semantic discrimination 
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measure is less than a predetermined threshold are removed from the input 
feature vector. Altematively, if the number of features is to be reduced to a 
predetermined number, then the input feature vector is reduced to the n 
features with highest semantic discriminations, 

5 2.B Approximate Reasoning Module 

This section describes a first embodiment of the approximate reasoning 
module 118 that uses feature-based models to classify a text object. As 
described below, the approximate reasoning module 118 include matching, 
filtering, and decision making mechanisms, which are described below in 
10 sections 2.B.1, 2.B.2, and 2.B.3, respectively. 

Generally, text classification performed by the approximate reasoning 
module 118 can be represented as a functional mapping from a set of text 

m features described above, {f^ to a set of class values {ci, Cc}. 

Typically each class is represented as rule that describes the mapping from 
15 the features to a class. These rules form part of knowledge base 122. 

More specifically, the approximate reasoning module 118 uses the 
m class rules (i.e., class fuzzy sets and class filters) in knowledge base 122 to 

classify an unlabelled text object 112. Viewed as such, a text classification 
O problem can be represented using a canonical knowledge base of the form: 

20 ri: IF filteri (DocumentFuzzySet is FuzzySetForClassi) THEN Classification is 
Classi ; 

rg: IF filter2 (DocumentFuzzySet is FuzzySetForClassa) THEN Classification is 
Class2 ; 

25 Vc'. IF filterc (DocumentFuzzySet is FuzzySetForClasSn)THEN Classification is 
Classc . 

In this embodiment, each clas^ Classi is represented by a corresponding rule 
n that consists of just onej condition, where DocumentFuzzySet and 
FuzzySetForClasSj are fuzzy S3ts of the form {i^/JUu fn///n}, and of a class 
30 filter that is a monotonic funcoon having adjustable parameters/form, where 
feature indexes the range from 1 to n and |Xi denotes the membership value 
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associated with feature i. 

By representing each class as described above, various approximate 
reasoning strategies can be used to classify an unlabelled document. The 
approximate reasoning approach of this embodiment relies on strategies that 
5 have been demonstrated in a variety of fields from control to image 
understanding, and various aspects of text processing such as thesauri 
construction. Such systems modeling is further described in Soft Computing 
by Shanahan. Additionally, approximate reasoning strategies are further 
described in "Applied Research in Fuzzy Technology", edited by Ralescu, 
10 Kluwer Academic Publishers, New York, 1995. 

In general, the approximate reasoning strategy consist of: (1) 
calculating the degree of match between each class fuzzy set, 
FuzzySetForClasSi, and the document fuzzy set, {fi/jUDocumentFuzzySet(fih --^ 
fn/jUDocumentFuzzyset(fn)} (as set forth in detail in section 2.B.1 below); (2) passing 
15 the resulting degree of match through the respective filter function (as set forth 
in detail in section 2.B.2 below); and finally (3) selecting a class label to assign 
to the unlabelled document based upon the filtered degrees of match (as set 
forth in detail in section 2.B.3 below). 

The input to such reasoning strategies is a knowledge base and the 
20 document fuzzy set, DocumentFuzzySet, corresponding to an unlabelled 
document. Document The document fuzzy set is constructed as described 
above using the bijective transformation process described in section 4 below. 
This results in a document fuzzy set DocumentFuzzySet of the form 

{fl/jUDocumentFuzzySet(fl), fr/MDocumentFuzzySet(fn)}' The knowledge base, aS 

25 described above, can be learned or predefined. 

2.B.1 Matching Strategies 

An approximate reasoning strategy is depicted in a flow diagram in 
Figure 13 that identifies the acts performed at 312 in Figure 3 for assigning a 
class label to an unlabeled text object using approximate reasoning to 
30 categorize the unlabeled text object. 

Initially, a degree of match between each class fuzzy set and the 
document fuzzy set is computed at 1302. This involves identifying fuzzy set 
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matching strategies that are used to measure the degree of match between 
each class fuzzy set, FuzzySetForClasSj, and the document fuzzy set, 
DocumentFuzzySet. 

In one embodiment, a max-min {i.e., maximum-minimum) strategy is 
5 used to calculate a degree of match between a fuzzy set for class "i", denoted 
as FuzzySetForClasSi, and an unlabelled document fuzzy set 
DocumentFuzzySet as follows: 



where f \s the text features (1...n) in the document, MDocumentFuzzyset(f) denotes 
10 the membership value of f in the fuzzy set of the Document, and ficiassFuzzySet 
(f) denotes the membership value of feature f in the fuzzy set of the Classj. 
This corresponds to the degree of membership of the unlabelled text object or 
Document 112 in each class Classi and is termed Match,. Figure .14 



graphically depicts an example of max-min strategy for the class fuzzy set and 
15 the document fuzzy set. In the example shown, first each text feature "f" is 
examined and the minimum value between each class fuzzy set and the 
document fuzzy set are selected (e.g., 1401, 1402, 1403, 1404). 
Subsequently, the maximum of these minimums is selected (e.g., 1403) to 
represent the match (or degree of match). 

20 In an alternate embodiment, a probabilistic matching strategy is used 
whereby, the degree of match between each class fuzzy set, 
FuzzySetForClassi, and the document fuzzy set, {fi/jUDocumentFuzzyset(fih 
fr/MDocumentFuzzyset(fn)h Calculated as a probabilistic conditionalization (point 
semantic unification) as follows: 



25 PriFuzzySetForClass. \ DocumentFuzzySet) - ^ M Fuzzy seiForciass, (fi >< ^^DocumemFuzzySei (fi ) 



where each text feature fj is associated with a class membership value ju 
FuzzysexForc\ass(fi) in the fuzzy set FuzzySetForClasSi. PrDocumentFuzzySet(^) denotes 
the probability distribution obtained from the fuzzy set DocumentFuzzySet 
using the bijective transformation process outlined in section 4 below (or 
30 simply in this case the normalized frequency distribution as with the unlabelled 
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document, Document). The value Pr{FuzzySetForClasSil DocumentFuzzySef) 
represents the degree of match between the FuzzySetForClasSi and the 
DocumentFuzzySet and is temied Matchj. 

2.B.2 Filtering The Degree Of Match 

5 Referring again to Figure 13, each computed degree of match is then 

filtered at 1304 to define an activation value for its associated class rule. 
These computed activation values are subsequently used to perform decision 
making at 1306. 

More specifically, the result of each class match, Match,', is passed 
10 through a class specific filter function. This act can be bypassed by setting the 
filter function to the identity function. The result of applying the filter function to 

i'3 

■[= the degree of match, Mafc/?,, results in an activation value activation, for rule />. 

In the instance where the filter is the identity function, the activation value 

ru 

Ca activation! for a rule n is simply equated to the corresponding class match, 

s'^ 15 Match,. The activation value for each rule is subsequently associated with the 
output value Class, for each rule. 

iS 

'vi Passing the degree of match "x" through the filter S(x) to arrive at an 

activation value is graphically depicted in Figure 15 for an example class 
Q specific filter function S(x). As illustrated in Figure 15, the degree of match "x" 

20 at "c" is passed through filter function S(x) to define a corresponding activation 
value "d". 

2.B.3 Decision iVlaking 

Referring again to Figure 13, decision-making takes place during which 
a class label is selected and assigned to the unlabelled document (or text 
25 object 112) at 1306, thereby classifying or categorizing the unlabeled text 
object. More specifically at 1306, an unlabeled text object 112 is classified by 
assigning to it the class label of each class rule that satisfies a selected 
decision making rule. 

In one embodiment, the selected decision making rule selects the class 
30 label of the output value ClassMax associated with the highest filtered activation 
value identified at 1304. In this embodiment at 1306, the unlabelled text object 
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or document, Document, is classified using the label associated with the class, 

ClaSSMax- 

In another embodiment, the unlabelled document is classified using a 
predefined number of labels. In one instance, "n" labels associated with the "n" 
highest filtered activation values at 1304 are assigned to the unlabeled text 
object 112 at 1306. 

In yet another embodiment, a threshold value T is chosen, such that any label 
associated with a filtered activation value at 1304 that is greater than the 
threshold value T is assigned to the unlabeled text object 1 12 at 1306. 

2.C Knowledge Base Learning 

Figure 16 illustrates an alternate embodiment in which a categorizer 
1600 includes a learning system 1601. The learning system 1601 uses a 
learning module 1602 to construct the knowledge base 122 of the fuzzy text 
classifier 110 shown in Figure 1. 

In the embodiment shown in Figure 3, acts 300 are performed to 
estimate class rules (i.e., class fuzzy sets and class filters) from a training 
database or corpus 1604 "Traiii' (i.e., a labeled collection of documents) for 
each possible document classification Class. The accuracy of the results can 
be tested against documents in a validation database 1 606. 

More formally, a training database 1604 "Trairi* is a collection of 
labeled documents consisting of tuples <D/, L/> where D, denotes the 
document and L, denotes the class label associated with D,. Training consists 
of the following actions: feature extraction (as described above), feature 
reduction (as described above), and class rule estimation (i.e., class fuzzy 
sets and class filters). Having extracted and reduced the features for all 
classes, learning of class fuzzy sets and class filters can begin. A class fuzzy 
set, FuzzySetForClasSi, can be denoted as follows: 

n 

FuzzySetForClass, = ' ^Fuu,Se,ForClass, (fj) . 

where each text feature fj is associated with a class membership value 

/^FuzzySetForClass (fj)- 
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As illustrated in the flow diagram in Figure 17, the construction of a 
class document fuzzy set, FuzzySetForClasSj, for ClasSi involves merging, at 
1720, all documents defined in a set of training documents for a class 
obtained, at 1710, that have a label ClasSi (e.g., tuples that have a label L = 
5 Classj). Subsequently, acts similar to those performed for constructing a 
document fuzzy set in Figure 4 are performed at 1730 to construct a class 
fuzzy set using the merged class document. Finally, if the last set of class 
training documents has been processed at 1740, then the routine terminates; 
othenA/ise, it continues at 1710. 

10 More specifically, the acts in Figure 4 that are also performed at 1730 

for the merged class document involve: performing feature extraction on the 
merged class document (at 402); filtering the extracted features to reduce the 
feature set if necessary (at 404); identifying the frequency of occurrence of 
ru each remaining feature (at 406); and generating a normalized frequency 

m 15 vector for this class, class, (at 408). This normalized frequency distribution is 
subsequently converted into the class document fuzzy set, FuzzySetForClasSj, 
using the bijective transformation process outlined in section 4 (at 410). 

2.C.1 Filter Identification 



Having identified the class fuzzy sets, estimation of the class rule filters 
20 can be carried out. This section describes one method for determining the 
class filter function from the data. Assuming the class rule filter takes the 
following functional form: 

5(x):[0,l]^[0,l], 

then the class rule filter structure in this embodiment is limited to a piece-wise 
25 linear function with two degrees of freedom and is canonically defined as 
follows and as shown in Figure 18: 



0 x>a 
X - a 



a<x<b , 
b-a 

1 otherwise 



where the function S(x) is constrained such that 0<a<Z7<l. More generally, 
two or more degrees of freedom could be used. 
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In an alternate embodiment, the structure of each class filter could be 
left free and the filter structure and parameters could be determined 
automatically. For example, a parametric monotonic function such as a 
sigmoid could be used as a filter. The default filter can be represented as the 
5 true fuzzy set, that is, the identity function S(x)=x. The number of degrees of 
freedom in a filter is restricted to two in order to increase the transparency of 
the model and also to reduce computational complexity in determining the 
filter. Alternatively, the structure of a filter function can be achieved using a 
genetic algorithm (or any other gradient-free optimization technique), where 
10 the length of the chromosome could be variable. 

In one embodiment, a method determines a filter with two degrees of 
O freedom for each class rule. Each filter is viewed as piecewise linear function 

,[5 with two degrees of freedom that ultimately determine the shape of the filter 

function. This is depicted in parameterized class filter shown in Figure 18. 
Cn 15 Varying a and b yields a large and flexible range of filters while maintaining 
; 3 filter transparency. In the example shown in Figure 18, a and b are subject to 

the following constraints: 



La 



0<a<b<\, 

The determination of class filters can be formulated as an optimization 
20 problem where the goal is to determine the set of class filters for that model. 
Any of a number of optimization techniques can be used to determine the filter 
structures. Since the filter functions are piecewise linear in nature, thereby 
producing discontinuities, the optimization techniques need to be derivative 
free. Consequently, a direction-set approach based upon Powell's 
25 minimization algorithm is used. Powell's minimization algorithm is disclosed by 
Powell, "An efficient method for finding the minimum of a function of several 
variables without calculating derivatives", Comput. J., Vol. 7, 1964, pp. 155- 
162 (hereinafter referred to as "Powell's minimization algorithm"), which is 
incorporated herein by reference. 

30 2.C.1.a Powell's Minimization Algorithm 

More specifically, Powell's minimization algorithm is an iterative 
approach that carries out function minimizations along favorable directions Uj 
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in A/-dimensional space. The technique involves nnininnizing the set of 
directions Uj, for i e [1, N]. It begins by randomly initializing each direction Uj, 
subject to constraints if they should exist and then repeating the following 
steps (a) - (e) until the function stops decreasing: (a) save starting position as 
5 Poi (b) for i G [1, N] move P\.^ to the minimum along direction Uj (using, for 
example, a golden section search) and call this point Pu (c) for i e [1, N-1] set 
Uj to Ui-i; (d) set un to Pn - Po; and (e) move Pn to the minimum along direction 
Un and call this point Pq. 

Using Powell's minimization algorithm, the minimization problem for 
10 determining class filters is defined as follows: each filter degree of freedom (a, 
and f?/ filter points) is viewed as a variable (directions Uj) in the range [0, 1] that 
Q satisfies the following constraint: 



!==# 



0<a <b <l. 

I 1 

The initial filters are set to the identity filter. Then a constrained Powell's 
15 direction set minimization is carried out for p iterations (empirical evidence 
suggests a range of [1, 10]) or until the function stops decreasing. Each 
iteration involves N (where N=: C * 2, where C is the number of text classes 
and 2 corresponds to the two degrees of freedom that each class filter 
possesses) direction sets (corresponding to number of filter variables), where 
20 the initial directions are set to the unit directions. In order to evaluate the cost 
function for a set of filters the corresponding fuzzy text classifier is constructed 
and evaluated on the validation database 1606 (e.g., a subset of the training 
database set that is not used for training). Following Powell minimization 
algorithm, the values associated with each of the variables, whose 
25 corresponding model yielded the lowest error on the validation, are taken as 
the result of the optimization and are used to generate the respective class 
rule filters. 

Finally, a class fuzzy set knowledge base 122 (shown in Figure 16) is 
generated at 306 (shown in Figure 3) using the learned class fuzzy sets at 302 
30 and class filters at 304. 
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3. Granule Feature Based Models 

This section sets forth a second embodiment of the categorizer 1 10 or 
1600 shown in Figures 1 and 16. Figure 19 depicts a flow diagram of the 
processes performed by a categorizer that uses a granule feature model 
5 instead of the class fuzzy model described above. Similar to a categorizer 
described above using fuzzy set models, a categorizer using granule feature 
based models performs a functional mapping from the set of features (e.g., 
text features), {fi, fn}, to a set of class values [ci, Cc}. 

Generally, a granule feature text categorizer consists of one class rule 
10 for each class C\ and has the following knowledge base forms: 

ri: IFfilteri ( 

DocumentFuzzySetfi is FuzzySetForClassi.fi wi.n 



DocumentFuzzySetfn is FuzzySetForClassi-fn Wi.fn 
15 ) THEN Classification is Classi 



rc: IF filterc ( 

DocumentFuzzySetfi is FuzzySetForClasSc-fi Wc.fi 



20 DocumentFuzzySetfn is FuzzySetF0rClasSc-fnWc.f1 

)THEN Classification is ClasSc . 

In this embodiment, each class Class, is represented by a 
corresponding rule A/that consists of n weighted conditions, (e.g., one for each 
text feature) and a class filter. In this case, each feature is a granule feature, 
25 where each of the feature values DocumentFuzzySetfj and 
FuzzySetForClassij-fj are granule fuzzy sets of the form {wi/ji 
DocumentFuzzySetfj(wi), Wp/ju DocumentFuzzySetfj(Wm)} and {wi//i 

FuzzySetForCiassii- fj(wi), Wn/ju FuzzySetForciassu-fj(Wm)} respectively, the details of 
which are discussed below. The weight associated with each feature can be 
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intuitively regarded as indicating the importance of that feature in the definition 
of a class. 

Referring again to Figure 19, if the embodiment is similar to that shown 
in Figure 16, the knowledge base 122 for the granule feature based model is 
5 learned at 1900; otherwise, the embodiment is similar to that shown in Figure 
1 where the knowledge base is predefined, not requiring actions 1900 to be 
performed. The details of learning at 1900 are discussed below in section 3.C. 

At 1908, a new document is accepted by the categorizer. For each new 
document received by the categorizer, document granule feature fuzzy sets 
10 are developed at 1910 and class labels are assigned to text objects using 
approximate reasoning to categorize the document at 1912, the details of 
which are discussed below in sections 3.A and 3.B. 

3.A Constructing Document Feature Fuzzy Sets 

Figure 20 depicts a flow diagram for constructing a document feature 
15 fuzzy set for a granule feature based model as set forth at 1910 in Figure 19. 
The document's features are extracted at 2002 and filtered (if necessary) at 
2004. At 2006, a next feature is identified in the features remaining after being 
filtered at 2004. Subsequently at 2008, a granule feature fuzzy set is 
constructed that corresponds to the value of the identified feature as 
20 described above in section 2. A. I.e. This process terminates once the last 
feature is processed at 2010, thereby defining the document feature fuzzy set 
to include each of the granule feature fuzzy sets constructed at 2008. 

3.B Approximate Reasoning Using Granule Feature Based Models 

Figure 21 depicts a flow diagram that details the act 1912 in Figure 19 
25 for assigning a class label to a text object using approximate reasoning to 
categorize the document. At 2102, the degree of match is computed between 
each class granule feature fuzzy set, FuzzySetForClassi,.fj, and the document 
granule feature fuzzy set, DocumentFuzzySetfj using any of the matching 
strategies outlined above in section 2.A.I. This leads to a degree of match 
30 matchj-fj ior each granule feature fj in each rule n. 
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Subsequently at 2104, individual degrees of matches are aggregated to 
define an overall degree of match using either an additive model or a product 
model. 

When an additive model is used to perform aggregation at 2104, the 
overall degree of match is set equal to the weighted sum of the degrees of 
match. More specifically in this embodiment, for each rule n a weighted 
combination of the constituent feature matches, matchi.fj, is taken as follows 
yielding an overall degree of match for rule r, of matchf. 

ft 

match. — y match r^ ,,. 

When a product model is used to perform aggregation at 2104 the 
overall degree of match is set equal to the product of the degrees of match. 
More specifically in this embodiment, the overall degree of match matchi for 
rule n is calculated by taking the product of the constituent feature matches, 
matchi-fj, as follows: 

n 

match = Y[match_^ . 

At 2106, the computed match, match,, for each rule n is then passed 
through a corresponding filter function filten resulting in an activation value for 
the rule activatiorii. The activation value for each rule that satisfies a selected 
decision making rule at 2108 is subsequently associated with the output value 
Classi (i.e., class label) for each rule. 

At 2108, decision-making takes place in which a class label is selected 
to assign to the unlabelled document (or text object). More specifically at 
2108, a text object is classifed by assigning to it the class label of each class 
rule that satisfies a selected decision making rule. 

In one embodiment, the decision making rule reduces to selecting a 
class label of the output value ClassMax associated with the highest activation 
value. That is, in this embodiment the unlabelled document. Document, is 
classified using the label associated with ClassMax- 
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In another embodiment, the unlabelled document is classified using a 
decision making rule that specifies a predefined number of labels. In one 
instance of this embodiment, the "n" labels associated with the "n" high 
activation values are assigned to the unlabeled text object. 

5 In yet another embodiment, a threshold value T could be chosen, such 

that any label associated with an activation value greater than the threshold 
value T is assigned to the unlabeled text object. 

3.C Learning Granule Feature Based Models 

Referring again to Figure 19, which presents at 1900 a process for the 
10 construction of a granule feature text classifier. At 1902, a granule fuzzy set is 
estimated for each granule feature from a training database or corpus 1604 
Train (i.e., a labeled pollection of documents) for each possible document 
classification Class. 1904, granule feature weights are estimated in the 
case where the granule features are aggregated as a weighted function (i.e., 
15 for additive models only). At 1906, class filters are estimated. Finally at 1907, 
a granule feature knowledge base is generated using the granule feature 
fuzzy sets estimated at 1902, the granule feature weights (if additive model) 
estimated at 1904, and the class filters estimated at 1906. 

Figure 22 depicts a flow diagram for a process that is used to estimate 
20 each granule fuzzy set at 1902 shown in Figure 19. Initially a training set is 
defined for documents in a class at 2202. More formally, a training database 
Train is a collection of labeled documents consisting of tuples <D/, C/>, where 
D, denotes the document and denotes the class label associated with D,. 

At 2204, granule feature fuzzy sets are extracted from each text object 
25 D, resulting in a feature vector {fi-i, fj-n}, where each fj.j is a granule fuzzy 
set value of the j^^ feature of training example /. At 2206, the extracted granule 
features are filtered, if necessary, to limit the number of features in the feature 
vector. 

In this case, the granule feature fuzzy set has the following form: {wj. 
30 i/jUwj-i(fj), Wj.rT/jUwj-m(fjj} "^^ere Wj-1,.,., wj,m represents a linguistic partition of 
the feature fj and jUwj-p(fj) represents the membership of the word Wj.p in the 
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granule fuzzy which in this case corresponds to the membership of normalized 
frequency value for feature fj in the fuzzy set denoted by Wj.p. 

For simplification of presentation but not loss of generality, each feature 
vector is assumed to consist of just one feature f. As a result of this 
5 simplification, each document D, is now represented in terms of one granule 
fuzzy set of the form {wi/^wi(f), Wn/Mwm(f)} where Wi,..., Wm represents a 
linguistic partition of the feature f and jUwp(f) represents the membership of the 
frequency value for feature f '\n the granule denoted by the fuzzy set Wp. 

Learning a granule feature representation of classes consists of a 
10 number of steps that are subsequently presented and graphically depicted in 
Figure 22. At 2208, for each class label Classk, a granule frequency 
distribution is then initialized; i.e., {WkVO, Wk^n/O}. At 2210, a next 
document in the training set is examined with class label ClasSk- 

5:9 In the case of each training example Dj, each feature frequency value 

15 (normalized or otherwise), in this instance limited to one feature f, is 
linguistically reinterpreted using the granules that partitions the frequency 
□ universe of f is generated; i.e., a granule feature fuzzy set (and described here 

il using one granule feature fuzzy set) <{wi/jUwi(f), . Wn/Mwm(f)h Ck>^ The 

!'^^ resulting granule feature fuzzy set {wi//iwi(f), Wm/Mwm(f)} is then converted 

20 into a probability distribution on granules {wi/Prwm(f), Wrr/Prwm(f)} using the 
bijective transformation outlined in section 4, at 2212. 

At 2214, each training example <{wi/Prwm(f)3 Wm/Prwm(f)}, Ck> is 
processed as follows: the granule frequency distribution {wi/freqk(wi), 
Wrr/freqkiwm)} associated with the class Ck is updated with {wi/Pr^m(f), . 
25 Wm/Prwm(f)}; i.e., the class granule frequency freqk(Wp) associated with each 
granule Wp is updated as follows: freqk(Wp) = freqk(Wp) + Pr^p (f)< 

Once all training examples have been processed at 2216, the class 
granule frequency distributions {wi/ Prck(wi), Wrr/ Prck(Wm)} is converted, at 
2218, into granule fuzzy sets using the bijective transformation outlined in 
30 section 4. At 2220, if an additional granule feature fuzzy set is to be learned 
then act 2202 is repeated; othenwise, the process terminates. 
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3.C.1 Estimating The Granule Feature Weights With Additive Model 

In one embodiment, the weights associated with each granule feature 
can be determined using semantic discrimination analysis. Semantic 
discrimination analysis measures the degree of mutual dissimilarity of classes, 
5 represented by fuzzy sets defined over the same feature universe. This 
similarity measure is based on point semantic unification. The semantic 
discriminating power of a granule feature fi for a class Class (i.e., Discrim_fi. 
class) is calculated as follows: 

Discrim_f,_,^^^ =1- Max Pr(F5jF5.), 
h3 10 where Pr(F5^ \ FS j) = ^ju^s^(x.)xPr^^ (x^) denotes the semantic match 

ru 

rg between fuzzy sets. This yields a vector of discrimination values (consisting of 

1'^] one discrimination value for each feature) for each rule. The weight associated 

i,3 with each granule feature fj is obtained by normalizing each value such that 

the resulting weights for a rule sum to one. This is achieved as follows for: 



15 ^ - D^scrim_fi^,^^ 

n 

^ Discrim_Fj 



w 

^ iClass 



jCIass 



where n corresponds to the number of features in the class rule (this can vary 
from class to class, see section 3.B.2 for details). This result can be further 
optimized by eliminating features with a learn weight value Wiciass that is less 
than a threshold weight T. Alternatively this result can be further optimized by 
20 selecting only the highest n weighted features for a class. 

3.C.2 Feature Weights Identification Using Powell's Minimization 
Algorithm 

In alternative embodiment, the determination of feature weights can be 
formulated as an optimization problem where the goal is to determine the set 
25 of weights that model the aggregation behavior of a problem effectively. Any 
of a number of optimization techniques can be used to determine the weights. 
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In one ennbodiment, a direction-set method approach based upon 
Powell's minimization algorithm is selected. Any other optimization technique 
could equally have been used. For example, the weights could be encoded in 
a chromosome structure and a genetic search carried out to determine near 
5 optimum weights as described by Holland in "Adaptation in Natural and 
Artificial Systems", University of Michigan Press: Ann Arbor, 1975. The cost or 
fitness function is based on the model error as determined on a validation 
database. The error measure of recall and precision, or a combination of 
these measures, such as F1, can be used. More details relating to error 
10 measures (i.e., information retrieval measures) are described by Manning and 
Schutze in "Foundations Of Statistical Natural Language Processing", 
published in 1999, MIT Press, Cambridge, MA. 

The weights identification problem is encoded as follows: each class 
ry rule weight wi is viewed as a variable that satisfies the following constraint: 

i' s s 

i;^ 15 o<w, <i. 

r The approach begins with estimating the weights by measuring the 

semantic separation of the inter class fuzzy sets using semantic discrimination 
analysis (see section 3.C.1). Then a constrained Powell's direction set 

\\\ 

\^ minimization, as described above, is carried out for p iterations or until the 

20 function stops decreasing. 

Each iteration involves N direction sets (where N = R * W,, and R 
corresponds to the number of rules in the knowledge base and M// denotes the 
number of feature weights in the body of rule R,), where the initial directions 
are set to the unit directions. Note in this case it is assumed that each class 
25 rule has equal numbers of weights W, however, this can vary for each class. 

In order to evaluate the cost function for a set of weights, the 
corresponding additive Cartesian granule feature model is constructed. The 
class rule weights are set to the normalized Powell variable values (i.e., the 
constituent weights for a class rule are normalized so that the weights for a 
30 rule sum to one). 

The constructed model is then evaluated on the validation database. In 
this case, the class filters are set to the identity function. Following Powell 
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minimization, the weight values, whose corresponding model yielded the 
lowest error, are taken to be the result of the optimization. 

Rule filters can subsequently be estimated using similar procedures as 
described in section 2.C.1. Subsequently, a granule feature knowledge base 
5 is generated. 

4. Converting A Probability Distribution To A Fuzzy Set 

This section describes a method for converting a probability distribution 
to a fuzzy set using the bijective transformation. More specifically, this section 
describes a transformation from a normalized frequency probability distribution 
10 Pr to a corresponding fuzzy set f for a discrete variable. A more detailed 
i«3 presentation of this transformation is described in Soft Computing by 

''I Shanahan. 

*S ST 

Let the fuzzy set f and the frequency probability distribution Pr be 
defined over the frame of discernment ^2w= {wi, Wn} with supports equal to 
^3 15 £2x and let P(X) be the power set of £2w- To simplify the presentation, it is 
Q assumed that no two probabilities in Prate equal and that the prior probability 

distributions on fuzzy portion labels is uniform. This bijective transformation 
LJ process is summarized by the following five steps: 

^ First, order the normalized frequency probability distribution Pr such that: 

20 Pr(Wi) > Pr(wj) if / >j ViJ e {1, , . n}. 

Second, since this bi-directional transformation is order preserving, the fuzzy 
set f can assume the following form: 

fif(Wi)>]Uf(Wj)\\ i>j\/ ije {1, n}. 

Third, this fuzzy set /"induces a possibility distribution ku which in turn induces 
25 a mass assignment of the form: 

<{Wi, Wn}:7rf(Wnh 



{wu Wk}:Kf(Wk) ' nf(Wk^i), 
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{Wi}:7tf(Wi) - 7tf(W2) >. 

Fourth, letting A = {wu Wi) Vis {1, n} and since MA/ (A) = 7Cf(Wi) - 
7Ct(Wj+i), the following equation can be defined: 

Pr'(w,)= 2 ^^/(^)^l^' 

which can be simplified to: 

" 1 
Pr'( ) = X ^/ ) ~ ^/ )( t — ^Jt } ) T • 

Fifth, for /=/? then: 

PrXWn)=7Uf({Wn})/n. 

such that: 

7if({Wn}) = nPr^Wn). 

The remaining values for Kf({Wi}) (i.e., /e{1, n-1}) can be solved for 
by direct substitution of 7if({Wi+i}). This leads to the following general equation 
for obtaining a possibility 7i:f({Wj}) corresponding the probability Pr({Wi}): 

;r,(vi;,) = /Pr'(w )+ XPr'K)V/e {l,...,n}. 

k=i+\ 

The preceding five steps results in a possibility distribution /Vf and a 
corresponding fuzzy set t 

By way of example, consider the following frequency distribution on the 
words Small, Medium and Large: 

Large: 0.6333 + Medium: 0.333 + Small: 0.0333. 

This probability distribution corresponds to the fuzzy set such that a 
prior probability was conditioned on this fuzzy set, that is Pr(Xlf) resulting in 
the above probability distribution. Assuming the prior distribution is uniform, it 
leads to the following fuzzy set: 

f= {Large: 1+ Medium:0.7 + Small:0.1} 
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where jLif(Wi) is calculated through its associated possibility as shown in Table 
1 (in which denotes product). 

Table 1 



Word Membership 


Membership Value 


^f(Large) = 7rf(Large) 


= 1*0.6333 + (0.333 +0.0333) = 1 


|if(Medium) = jCf(Medium) 


= 2*0.333 +(0.0333) = 0.7 


M,f(Small) = 7rf(Small) 


= 3*0.0333 = 0.1 



5 5. Miscellaneous 

It will be appreciated by those skilled in the art that the categorizers 
described herein can be ennbodied using software components and hardware 
components that operate on computer systems such as: a personal computer, 
a workstation, a mobile/cellular phone, a handheld device etc. 

10 The hardware components include a Central Processing Unit (i.e., 

CPU), Random Access Memory (RAM), Read Only Memory (ROM), User 
Input/Output ("I/O"), and network I/O. The User I/O may be coupled to various 
input and output devices, such as a keyboard, a cursor control device (e.g., 
pointing stick, mouse, etc.), a display, a floppy disk, a disk drive, an image 

15 capture device (e.g., scanner, camera), etc. 

RAM is used by CPU as a memory buffer to store data. The display is 
an output device that displays data provided by CPU or other components in a 
computer system. In one embodiment, display is a raster device. Alternately, 
the display may be a CRTs or LCD. Furthermore, user I/O may be coupled to 

20 a floppy disk and/or a hard disk drive to store data. Other storage devices 
such as nonvolatile memory (e.g., flash memory), PC-data cards, or the like, 
can also be used to store data used by computer system. The network I/O 
provides a communications gateway to a network such as a LAN, WAN, or the 
Internet. The network I/O is used to send and receive data over a network 

25 connected to one or more computer systems or peripheral devices. 
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The software components include operating system software, 
application program(s), and any number of elements of the categorizers. It 
should be noted that not all software components are required for all the 
described embodiments. The operating system software may represent an 

5 MS-DOS, the Macintosh OS, OS/2, WINDOWS®, WINDOWS® NT, Unix 
operating systems. Palm operating system, or other known operating systems. 
Application Program(s) may represent one or more application programs such 
as word processing programs, spreadsheet programs, presentation programs, 
auto-completion programs, editors for graphics and other types of multimedia 

10 such as images, video, audio etc. 

The categorizer 110 may be implemented by any one of a plurality of 
'^3 configurations. For example, the processor may in alternative embodiments, 

be defined by a collection of microprocessors configured for multiprocessing. 

In yet other embodiments, the functions provided by software components 
15 may be distributed across multiple computing devices (such as computers and 
O peripheral devices) acting together as a single processing unit. Furthermore, 

^'=1 one or more aspects of software components may be implemented in 

hardware, rather than software. For other alternative embodiments, the 
Lj computer system may be implemented by data processing devices other than 

j;^ 20 a general-purpose computer. 

Using the foregoing specification, the invention may be implemented as 
a machine (or system), process (or method), or article of manufacture by 
using standard programming and/or engineering techniques to produce 
programming software, firmware, hardware, or any combination thereof. 

25 Any resulting program(s), having computer-readable program code, 

may be embodied within one or more computer-usable media such as 
memory devices or transmitting devices, thereby making a computer program 
product or article of manufacture according to the invention. As such, the 
terms "article of manufacture" and "computer program product" as used herein 

30 are intended to encompass a computer program existent (permanently, 
temporarily, or transitorily) on any computer-usable medium such as on any 
memory device or in any transmitting device. 
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Executing program code directly from one medium, storing program 
code onto a medium, copying the code from one medium to another medium, 
transmitting the code using a transmitting device, or other equivalent acts may 
involve the use of a memory or transmitting device which only embodies 
5 program code transitorily as a preliminary or final step in making, using, or 
selling the invention. 

Memory devices include, but are not limited to, fixed (hard) disk drives, 
floppy disks (or diskettes), optical disks, magnetic tape, semiconductor 
memories such as RAM, ROM, Proms, etc. Transmitting devices include, but 
10 are not limited to, the Internet, intranets, electronic bulletin board and 
message/note exchanges, telephone/modem based network communication, 

□ hard-wired/cabled communication network, cellular communication, radio 
3 wave communication, satellite communication, and other stationary or mobile 

network systems/communication links. 

I' 1=1 

j;^ 15 A machine embodying the invention may involve one or more 

=3 processing systems including, but not limited to, CPU, memory/storage 

□ devices, communication links, communication/transmitting devices, servers, 
I/O devices, or any subcomponents or individual parts of one or more 

yJ processing systems, including software, firmware, hardware, or any 

lI 20 combination or sub-combination thereof, which embody the invention as set 
forth in the claims. 

The invention has been described with reference to particular 
embodiments. Modifications and alterations will occur to others upon reading 
and understanding this specification taken together with the drawings. The 
25 embodiments are but examples, and various alternatives, modifications, 
variations or improvements may be made by those skilled in the art from this 
teaching which are intended to be encompassed by the following claims. 
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